ADM: Market basket Analysis/Association Rules
Kenapa AA penting di Data Mining?
-
Market Basket data analysis, cross-marketing, catalog design, sale campaign analysis
- Web log (click stream) analysis, DNA sequence analysis, etc.
Image Source:
Association Analysis (AA)
- Mencari hubungan (links) antar variabel menurut himpunan records di data
- Links ini disebut sebagai asosiasi (ASSOCIATION).
-
Tiga tipe permasalahan asosiasi:
-
Association discovery (tidak terurut - yang akan dibahas di kuliah ini)
- Sequential pattern discovery (terurut - tidak dibahas)
- Similar time discovery (ada informasi waktu - misal log analysis)
Association Rules ~ Market Basket AnalysisΒΆ
image Source: https://www.kdnuggets.com/2018/07/minimum-viable-data-product.html
image source: https://www.youtube.com/watch?v=VZL6uhA8XKg
Association Rules (AR) dalam satu paragraph
AR berusaha menemukan semua himpunan ITEM (ITEMSETS) yang memiliki SUPPORT lebih besar dari MINIMUM SUPPORT, kemudian menggunakan itemsets yang signifikan untuk menghasilkan RULES yang memiliki CONFIDENCE lebih besar dari suatu MINIMUM CONFIDENCE. Rules ini akan dinilai berharga (signifikan) berdasarkan nilai LIFT-nya. Aplikasi paling populer AR adalah Market Basket Analysis (MBA).
Items dan Itemsets
- Data AR berbentuk "transaksi": himpunan itemsets yang masing-masing elemen himpunannya adalah items
- Items: Bread, Milk, Coke, dll
- Itemset: {Bread, Milk}
- Contoh transaksi pada suatu hari di sebuah toko:
| TID | Items |
|---|---|
| 1 | Bread, Milk |
| 2 | Bread, Diaper, Beer, Eggs |
| 3 | Milk, Diaper, Beer, Coke |
| 4 | Bread, Milk, Diaper, Beer |
| 4 | Bread, Milk, Diaper, Coke |
Secara Formal (Ringkasan Teori AR)
- Item adalah elemen himpunan dari data, contoh:Β Milk,Bread,Eggs
- Itemset adalah kemungkinan subset yang dibentuk dari item, contoh:Β {Milk,Bread,Eggs} atau {Milk, Eggs}.
- Frekuensi kemunculan item atau itemset dalam data disebut Support:
- Jika support > dari suatu nilai ambang (threshold) maka itemset tersebut disebutΒ frequent itemset.
- Sebuah Rule berbentukΒ XβY dimanaΒ XΒ (Antecedent) danΒ YΒ (Consequent) adalahΒ itemsets. Contoh:
- {Milk,Diaper}β{Beer}
- Support dari sebuah rule adalah banyaknya transaksi yang memuat X dan Y.
- s(XβY)=s(XβͺY)
- Dalam association rule mining, kita ingin mencari Rules yang memilikiΒ Β support and confidence yang signifikan.Β
- Nilai expected confidence tak bersyaratΒ di AR disebut juga sebagai "lift:"
- Lift<1 dianggap "negatif" (less than expected)
Lift = 1 : netral - ["lift"] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data
Contoh Rule:ΒΆ
Mie Instant ==> Saos SambalΒΆ
Rules digunakan dalam marketing untuk membuat berbagai keputusan, beberapa contohnya:
- Letakkan kedua barang berdekatan (agar ndak lupa keduanya untuk dibeli)
- Letakkan kedua barang berjauhan (agar konsumen akan melihat-lihat barang yang lain)
- Satukan kedua barang dalam sebuah promo (promo akan jadi lebih menarik karena konsumen memang membutuhkan keduanya)
- Satukan kedua barang dengan barang lain yang kurang laku (Cross selling)
- Naikkan barang yang satu dan turunkan yang lain (teknik kompetisi dengan "toko sebelah")
- Jangan iklankan kedua barang bersamaan.
- Tawarkan promo saos dalam bentuk sachet gratis setiap membeli mie instan premium.
Rule, Support, Confidence, Lift by ExampleΒΆ
Image Source: http://www.saedsayad.com/association_rules.htmΒΆ
SupportΒΆ
Support rule A==>B adalah probabilitas A dan B muncul bersamaan: $$ Support(A==>B) = \frac{|A \cap B|}{|T|} $$ dimana $|A\cap B|$ adalah jumlah transaksi yang mengandung produk A dan B dan $|T|$ adalah total transaksi yang ada.
ConfidenceΒΆ
Confidence rule A=>B adalah probabilitas bersyarat dari B jika diketahui A, di AR dihitung sebagai: $$ Confidence(A=>B) = \frac{|A\cap B|}{|B|}$$
LiftsΒΆ
Lift rule A=>B adalah sebuah ukuran seberapa lebih sering A dan B muncul bersamaan dibandingkan jika mereka saling bebas secara statistika. Jika A dan B saling bebas maka Lift(A=>B)=1 dan jika lift positif maka dikatakan A dan B berkorelasi positif dan negatif untuk sebaliknya. Lift(A=>B) dihitung sebagai: $$ lift(A=>B)=\frac{confidence(A=>B)}{P(B)}=\frac{P(A\cap B)}{P(A)P(B)}$$ Perhatikan Lift bersifat simetris: Lift(A=>B) = Lift(B=>A)
LeverageΒΆ
Leverage mirip dengan lift, hanya saja Leverage menghitung perbedaan (selisih instead of perbandingan seperti lift) antara frekuensi A dan B muncul bersamaan dan frekuensi A dan B jika ia independent. Nilai leverage = 0 menandakan saling bebas antara A dan B. Leverage dihitung sebagai: $$ Leverage(A=>B)= Support(A=>B) - Support(A) \times Support(B)$$
Referensi untuk Leaverage: Piatetsky-Shapiro, G., Discovery, analysis, and presentation of strong rules. Knowledge Discovery in Databases, 1991: p. 229-248.
RangkumanΒΆ
Semua aturan diatas dengan apik dirangkum sebagai berikut:
image source: http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/%22%3Ehttp://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/ΒΆ
Pemilihan RulesΒΆ
Pada aplikasinya AR akan menghasilkan banyak sekali Rules dari data. Namun demikian tentu saja tidak semua rules ini akan digunakan dalam pengambilan kebijakan. Untuk mengurangi jumlah rule, sebaiknya barang-barang (sejenis) dikategorikan/kelompokkan terlebih dahulu. Kemudian akan dipilih rule-rule yang memenuhi kriteria berikut (mengapa? Silahkan diskusikan di Forum):
- Rule dengan Lift besar dan kecil.
- Items yang paling sering (dan jarang) muncul.
Prinsip Apriori (Sifat anti-monotone)ΒΆ
Jika sebuah itemset sering muncul, maka semua subset-nya juga pasti sering muncul. Begitupula kebalikannya juga berlaku, jika sebuah itemset jarang muncul, maka semua superset-nya pasti juga jarang muncul. Secara formal dituliskan $$ \forall A, B : (A\subset B) => s(A) \geq s(B) $$ Atau dengan kata lain support itemset tidak akan pernah melebihi support dari subset-nya. Sifat ini menjadi sangat penting nanti untuk mengurangi komputasi (Computational Complexity) dari perhitungan rules dari data.
Algoritma Association Rules:ΒΆ
Walau teori dari AR cukup sederhana, namun terdapat cukup banyak algoritma di AR, diantaranya AIS, Apriori, SETM, AprioriTid, Apriori Hybrid. Kebanyakan dari algoritma ini berbeda karena perbedaan upaya untuk mengurangi komputasi. Di kesempatan ini hanya akan dibahas secara sekilas algoritma AIS dan Apriori.
Algoritma AIS:ΒΆ
- Kandidat itemset dihasilkan dan dihitung frekuensinya seiring dengan munculnya data baru.
- Untuk setiap transaksi, ditentukan itemset besar mana yang terdapat dalam transaksi ini berdasarkan data yang ada.
- Kandidat itemset baru dihasilkan dengan memperluas itemset-itemset yang ada dengan item-tem lain di dalam transaksi yang ada.
- lebih jelasnya dapat dilihat pada gambar berikut:
- Kekurangan algoritma AIS adalah menghasilkan terlalu banyak kandidat itemset yang ternyata bernilai kecil.
ilustrasi algoritma AIS dan pengaplikasian minimum support untuk mengurangi komputasi.
Algoritma Apriori
- Candidate itemsets are generated using only the large itemsets of the previous pass without considering the transactions in the database.
- The large itemset of the previous pass is joined with itself to generate all itemsets whose size is higher by 1.
- Each generated itemset that has a subset which is not large is deleted. The remaining itemsets are the candidate ones.
Image Source: http://www.saedsayad.com/association_rules.htm
Algoritma Lain:ΒΆ
- SETM Algorithm
- AprioriTid Algorithm
- AprioriHybrid Algorithm
- dsb
Diskusi:ΒΆ
- Barang di toko terlalu banyak macamnya ==> how to deal with it?
- AR inferential? Seberapa sering rule di generate?
Referensi:
- [1]: J. Han, J. Pei, Y. Yin, R. Mao.
- Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach. 2004.https://www.cs.sfu.ca/~jpei/publications/dami03_fpgrowth.pdf
- [2]: R. Agrawal, C. Aggarwal, V. Prasad.
- Depth first generation of long patterns. 2000.Β http://www.cs.tau.ac.il/~fiat/dmsem03/Depth%20First%20Generation%20of%20Long%20Patterns%20-%202000.pdf
- [3]: R. Agrawal, et al.
- Fast Discovery of Association Rules. 1996.Β http://cs-people.bu.edu/evimaria/cs565/advances.pdf
Beberapa Modul Model Rekomendasi di Python:ΒΆ
- Crab (discontinued).
- Surprise
- Python Recsys (discontinued/very Old)
- MRec (discontinued)
- mlxtend (very limited documentation)
- PyCaret: https://pycaret.readthedocs.io/en/latest/api/arules.html
Kita akan mencoba juga Orange: Python GUI untuk Data Mining.
Image Source: http://gp.mx.tl/oranged-net-software
Add-ons di OrangeΒΆ
- Installing Add-Ons "Associate" di Orange
- pip install orange3
- python -m Orange.canvas (Create shortcut with this command)
- Install the add-on
- http://orange3-associate.readthedocs.org/
File data CSV/XLS(X) di Orange (3 Headers Format)
[1]. Feature Names (Nama variabel).
[2]. Feature typesΒ on the second line. The type is determined automatically, or, if set, can be any of the following:
discreteΒ (orΒd) β imported asΒOrange.data.DiscreteVariable,- a space-separatedΒ list of discrete values, like β
maleΒ femaleβ, which will result inΒOrange.data.DiscreteVariableΒ with those values and in that order. If the individual values contain a space character, it needs to be escaped (prefixed) with, as common, a backslash (β') character. continuousΒ (orΒc) β imported asΒOrange.data.ContinuousVariable,stringΒ (orΒs, orΒtext) β imported asΒOrange.data.StringVariable,timeΒ (orΒt) β imported asΒOrange.data.TimeVariable, if the values parse asΒ ISO 8601Β date/time formats,basketΒ β used for storing sparse data. More on basket formats in a dedicated section.
[3]. FlagsΒ (optional) on the third header line. Featureβs flag can be empty, or it can contain, space-separated, a consistent combination of:
classΒ (orΒc) β feature will be imported as a class variable. Most algorithms expect a single class variable.metaΒ (orΒm) β feature will be imported as a meta-attribute, just describing the data instance but not actually used for learning,weightΒ (orΒw) β the feature marks the weight of examples (in algorithms that support weighted examples),ignoreΒ (orΒi) β feature will not be imported,<key>=<value>Β custom attributes.
Contoh di Orange
- Contoh diambil dari https://blog.biolab.si/2016/04/25/association-rules-in-orange/
- Input Data : "FoodMart 2000 Dataset"
- Drag Node "DataSet"Β (Open) ==> "Send Data"
- Drag Node "Data" ==> Open/Send Automatically
- Drag Nodes Frequent ItemSets
- Drag Nodes Association Rules
!pip install mlxtend
Collecting mlxtend
Obtaining dependency information for mlxtend from https://files.pythonhosted.org/packages/73/da/d5d77a9a7a135c948dbf8d3b873655b105a152d69e590150c83d23c3d070/mlxtend-0.23.0-py3-none-any.whl.metadata
Downloading mlxtend-0.23.0-py3-none-any.whl.metadata (7.3 kB)
Requirement already satisfied: scipy>=1.2.1 in c:\users\taufi\anaconda\envs\teaching\lib\site-packages (from mlxtend) (1.11.1)
Requirement already satisfied: numpy>=1.16.2 in c:\users\taufi\anaconda\envs\teaching\lib\site-packages (from mlxtend) (1.25.2)
Requirement already satisfied: pandas>=0.24.2 in c:\users\taufi\anaconda\envs\teaching\lib\site-packages (from mlxtend) (2.0.3)
Requirement already satisfied: scikit-learn>=1.0.2 in c:\users\taufi\anaconda\envs\teaching\lib\site-packages (from mlxtend) (1.1.3)
Requirement already satisfied: matplotlib>=3.0.0 in c:\users\taufi\anaconda\envs\teaching\lib\site-packages (from mlxtend) (3.7.2)
Requirement already satisfied: joblib>=0.13.2 in c:\users\taufi\anaconda\envs\teaching\lib\site-packages (from mlxtend) (1.2.0)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\taufi\anaconda\envs\teaching\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (1.0.5)
Requirement already satisfied: cycler>=0.10 in c:\users\taufi\anaconda\envs\teaching\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\taufi\anaconda\envs\teaching\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (4.25.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\taufi\anaconda\envs\teaching\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (1.4.4)
Requirement already satisfied: packaging>=20.0 in c:\users\taufi\anaconda\envs\teaching\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (23.1)
Requirement already satisfied: pillow>=6.2.0 in c:\users\taufi\anaconda\envs\teaching\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (9.3.0)
Requirement already satisfied: pyparsing<3.1,>=2.3.1 in c:\users\taufi\anaconda\envs\teaching\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\taufi\anaconda\envs\teaching\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\taufi\anaconda\envs\teaching\lib\site-packages (from pandas>=0.24.2->mlxtend) (2022.7)
Requirement already satisfied: tzdata>=2022.1 in c:\users\taufi\anaconda\envs\teaching\lib\site-packages (from pandas>=0.24.2->mlxtend) (2023.3)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\taufi\anaconda\envs\teaching\lib\site-packages (from scikit-learn>=1.0.2->mlxtend) (2.2.0)
Requirement already satisfied: six>=1.5 in c:\users\taufi\anaconda\envs\teaching\lib\site-packages (from python-dateutil>=2.7->matplotlib>=3.0.0->mlxtend) (1.16.0)
Downloading mlxtend-0.23.0-py3-none-any.whl (1.4 MB)
---------------------------------------- 0.0/1.4 MB ? eta -:--:--
--------------------------------------- 0.0/1.4 MB ? eta -:--:--
- -------------------------------------- 0.0/1.4 MB 653.6 kB/s eta 0:00:03
-- ------------------------------------- 0.1/1.4 MB 655.4 kB/s eta 0:00:03
----- ---------------------------------- 0.2/1.4 MB 1.1 MB/s eta 0:00:02
------------ --------------------------- 0.5/1.4 MB 2.1 MB/s eta 0:00:01
----------------------- ---------------- 0.8/1.4 MB 3.1 MB/s eta 0:00:01
--------------------------------- ------ 1.2/1.4 MB 3.9 MB/s eta 0:00:01
--------------------------------------- 1.4/1.4 MB 4.2 MB/s eta 0:00:01
--------------------------------------- 1.4/1.4 MB 4.2 MB/s eta 0:00:01
---------------------------------------- 1.4/1.4 MB 3.5 MB/s eta 0:00:00
Installing collected packages: mlxtend
Successfully installed mlxtend-0.23.0
import warnings; warnings.simplefilter('ignore')
import pandas as pd, matplotlib.pyplot as plt, seaborn as sns
from itertools import combinations
from collections import Counter
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
#from pycaret.arules import *
%matplotlib inline
plt.style.use('bmh'); sns.set()
# In Python
T = [
('Bread', 'Milk'),
('Beer', 'Bread', 'Diaper', 'Eggs', 'Milk', 'Bread', 'Milk', 'Milk'),
('Beer', 'Coke', 'Diaper', 'Milk'),
('Beer', 'Bread', 'Diaper', 'Milk'),
('Bread', 'Coke', 'Diaper', 'Milk', 'Diaper'),
]
T
[('Bread', 'Milk'),
('Beer', 'Bread', 'Diaper', 'Eggs', 'Milk', 'Bread', 'Milk', 'Milk'),
('Beer', 'Coke', 'Diaper', 'Milk'),
('Beer', 'Bread', 'Diaper', 'Milk'),
('Bread', 'Coke', 'Diaper', 'Milk', 'Diaper')]
# Calculating item sets
# Nostalgia Matematika Diskrit :)
def subsets(S, k):
return [set(s) for s in combinations(S, k)]
subsets({1, 2, 3, 7, 8}, 2)
[{1, 2},
{1, 3},
{1, 7},
{1, 8},
{2, 3},
{2, 7},
{2, 8},
{3, 7},
{3, 8},
{7, 8}]
# Calculating support
Counter(T[1])
Counter({'Beer': 1, 'Bread': 2, 'Diaper': 1, 'Eggs': 1, 'Milk': 3})
# Using Module
# Taken from https://pbpython.com/market-basket-analysis.html
# Pertama-tama load Data
try:
df = pd.read_csv('data/Online_Retail.csv')
except:
df = pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx')
df.head(10)
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
|---|---|---|---|---|---|---|---|---|
| 0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom |
| 1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
| 2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom |
| 3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
| 4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
| 5 | 536365 | 22752 | SET 7 BABUSHKA NESTING BOXES | 2 | 2010-12-01 08:26:00 | 7.65 | 17850.0 | United Kingdom |
| 6 | 536365 | 21730 | GLASS STAR FROSTED T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 4.25 | 17850.0 | United Kingdom |
| 7 | 536366 | 22633 | HAND WARMER UNION JACK | 6 | 2010-12-01 08:28:00 | 1.85 | 17850.0 | United Kingdom |
| 8 | 536366 | 22632 | HAND WARMER RED POLKA DOT | 6 | 2010-12-01 08:28:00 | 1.85 | 17850.0 | United Kingdom |
| 9 | 536367 | 84879 | ASSORTED COLOUR BIRD ORNAMENT | 32 | 2010-12-01 08:34:00 | 1.69 | 13047.0 | United Kingdom |
# Preprocessing
df['Description'] = df['Description'].str.strip() # remove unnecessary spaces
df['Description'] = df['Description'].str.lower() # lower case normalization
df.dropna(axis=0, subset=['InvoiceNo'], inplace=True) # delete rows with no invoice no
df['InvoiceNo'] = df['InvoiceNo'].astype('str') # Change data type
df = df[~df['InvoiceNo'].str.contains('c')] # remove invoice with C in it
df.head()
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
|---|---|---|---|---|---|---|---|---|
| 0 | 536365 | 85123A | white hanging heart t-light holder | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom |
| 1 | 536365 | 71053 | white metal lantern | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
| 2 | 536365 | 84406B | cream cupid hearts coat hanger | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom |
| 3 | 536365 | 84029G | knitted union flag hot water bottle | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
| 4 | 536365 | 84029E | red woolly hottie white heart. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
df.to_csv("data/Online_Retail.csv", encoding='utf8', index=False)
'Done'
'Done'
filter_ = {'pls', 'plas'}
for f in filter_:
df = df[~df['InvoiceNo'].str.contains(f)] # filtering invoice
print(set(df['Country']))
{'Singapore', 'United Arab Emirates', 'European Community', 'Sweden', 'Malta', 'Norway', 'United Kingdom', 'Switzerland', 'Unspecified', 'Japan', 'USA', 'France', 'Spain', 'Cyprus', 'EIRE', 'Italy', 'RSA', 'Austria', 'Israel', 'Channel Islands', 'Netherlands', 'Saudi Arabia', 'Belgium', 'Australia', 'Hong Kong', 'Lithuania', 'Portugal', 'Finland', 'Germany', 'Brazil', 'Bahrain', 'Lebanon', 'Iceland', 'Czech Republic', 'Canada', 'Denmark', 'Greece', 'Poland'}
df_A = df[df['Country'] =="Australia"]
df_A.head()
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
|---|---|---|---|---|---|---|---|---|
| 197 | 536389 | 22941 | christmas lights 10 reindeer | 6 | 2010-12-01 10:03:00 | 8.50 | 12431.0 | Australia |
| 198 | 536389 | 21622 | vintage union jack cushion cover | 8 | 2010-12-01 10:03:00 | 4.95 | 12431.0 | Australia |
| 199 | 536389 | 21791 | vintage heads and tails card game | 12 | 2010-12-01 10:03:00 | 1.25 | 12431.0 | Australia |
| 200 | 536389 | 35004C | set of 3 coloured flying ducks | 6 | 2010-12-01 10:03:00 | 5.45 | 12431.0 | Australia |
| 201 | 536389 | 35004G | set of 3 gold flying ducks | 4 | 2010-12-01 10:03:00 | 6.35 | 12431.0 | Australia |
type(df_A)
pandas.core.frame.DataFrame
# Let's sample the data
basket = df[df['Country'] =="Australia"]
basket.head()
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
|---|---|---|---|---|---|---|---|---|
| 197 | 536389 | 22941 | christmas lights 10 reindeer | 6 | 2010-12-01 10:03:00 | 8.50 | 12431.0 | Australia |
| 198 | 536389 | 21622 | vintage union jack cushion cover | 8 | 2010-12-01 10:03:00 | 4.95 | 12431.0 | Australia |
| 199 | 536389 | 21791 | vintage heads and tails card game | 12 | 2010-12-01 10:03:00 | 1.25 | 12431.0 | Australia |
| 200 | 536389 | 35004C | set of 3 coloured flying ducks | 6 | 2010-12-01 10:03:00 | 5.45 | 12431.0 | Australia |
| 201 | 536389 | 35004G | set of 3 gold flying ducks | 4 | 2010-12-01 10:03:00 | 6.35 | 12431.0 | Australia |
# Group the transaction
basket = basket.groupby(['InvoiceNo', 'Description'])['Quantity']
basket.head()
197 6
198 8
199 12
200 6
201 4
..
497681 20
497682 24
497683 20
497684 12
497685 12
Name: Quantity, Length: 1259, dtype: int64
basket.sum().unstack()
| Description | 10 colour spaceboy pen | 12 pencil small tube woodland | 12 pencils tall tube posy | 12 pencils tall tube red retrospot | 16 piece cutlery set pantry design | 20 dolly pegs retrospot | 3 hook hanger magic garden | 3 stripey mice feltcraft | 3 tier cake tin green and cream | 3 tier cake tin red and cream | ... | wrap doiley design | wrap dolly girl | wrap english rose | wrap i love london | wrap poppies design | wrap red apples | wrap red vintage doily | wrap vintage leaf design | wrap wedding day | yellow giant garden thermometer |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| InvoiceNo | |||||||||||||||||||||
| 536389 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 537676 | NaN | NaN | NaN | NaN | NaN | 24.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 539419 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 540267 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 540280 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| C560540 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | -1.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | -1.0 | NaN | NaN |
| C561227 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| C568694 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| C574019 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| C574344 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
69 rows Γ 609 columns
# Jumlahkan, unstack, Null=0, index baris menggunakan Nomer Invoice
basket = basket.sum().unstack().reset_index().fillna(0).set_index('InvoiceNo')
basket.head()
| Description | 10 colour spaceboy pen | 12 pencil small tube woodland | 12 pencils tall tube posy | 12 pencils tall tube red retrospot | 16 piece cutlery set pantry design | 20 dolly pegs retrospot | 3 hook hanger magic garden | 3 stripey mice feltcraft | 3 tier cake tin green and cream | 3 tier cake tin red and cream | ... | wrap doiley design | wrap dolly girl | wrap english rose | wrap i love london | wrap poppies design | wrap red apples | wrap red vintage doily | wrap vintage leaf design | wrap wedding day | yellow giant garden thermometer |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| InvoiceNo | |||||||||||||||||||||
| 536389 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 537676 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 24.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 539419 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 540267 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 540280 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows Γ 609 columns
def encode_units(x):
if x <= 0:
return 0
if x >= 1:
return 1
basket_sets = basket.applymap(encode_units) # one-hot encoding
basket_sets.head()
| Description | 10 colour spaceboy pen | 12 pencil small tube woodland | 12 pencils tall tube posy | 12 pencils tall tube red retrospot | 16 piece cutlery set pantry design | 20 dolly pegs retrospot | 3 hook hanger magic garden | 3 stripey mice feltcraft | 3 tier cake tin green and cream | 3 tier cake tin red and cream | ... | wrap doiley design | wrap dolly girl | wrap english rose | wrap i love london | wrap poppies design | wrap red apples | wrap red vintage doily | wrap vintage leaf design | wrap wedding day | yellow giant garden thermometer |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| InvoiceNo | |||||||||||||||||||||
| 536389 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 537676 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 539419 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 540267 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 540280 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows Γ 609 columns
Understanding the Data StructureΒΆ
basket_sets.columns
Index(['10 colour spaceboy pen', '12 pencil small tube woodland',
'12 pencils tall tube posy', '12 pencils tall tube red retrospot',
'16 piece cutlery set pantry design', '20 dolly pegs retrospot',
'3 hook hanger magic garden', '3 stripey mice feltcraft',
'3 tier cake tin green and cream', '3 tier cake tin red and cream',
...
'wrap doiley design', 'wrap dolly girl', 'wrap english rose',
'wrap i love london', 'wrap poppies design', 'wrap red apples',
'wrap red vintage doily', 'wrap vintage leaf design',
'wrap wedding day', 'yellow giant garden thermometer'],
dtype='object', name='Description', length=609)
basket_sets.index
Index(['536389', '537676', '539419', '540267', '540280', '540557', '540700',
'541149', '541271', '541520', '541657', '542542', '543357', '543372',
'543376', '543989', '545065', '545475', '546135', '547659', '548661',
'549313', '552956', '553546', '554037', '554126', '556917', '556918',
'558536', '558537', '559919', '559920', '560033', '560473', '560491',
'561040', '561228', '563179', '563614', '565145', '565146', '565466',
'567085', '568145', '568687', '568695', '568708', '569647', '569650',
'569722', '569723', '574014', '574138', '574469', '576394', '576586',
'578459', 'C538723', 'C543375', 'C545525', 'C548729', 'C551348',
'C555046', 'C555288', 'C560540', 'C561227', 'C568694', 'C574019',
'C574344'],
dtype='object', name='InvoiceNo')
basket_sets.iloc[0]
Description
10 colour spaceboy pen 0
12 pencil small tube woodland 0
12 pencils tall tube posy 0
12 pencils tall tube red retrospot 0
16 piece cutlery set pantry design 0
..
wrap red apples 0
wrap red vintage doily 0
wrap vintage leaf design 0
wrap wedding day 0
yellow giant garden thermometer 0
Name: 536389, Length: 609, dtype: int64
basket_sets.loc['553546'].sum()
73
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)
frequent_itemsets.sort_values(by='support', ascending=False, na_position='last', inplace = True)
frequent_itemsets
C:\Users\taufi\anaconda\envs\Teaching\lib\site-packages\mlxtend\frequent_patterns\fpcommon.py:110: DeprecationWarning: DataFrames with non-bool types result in worse computationalperformance and their support might be discontinued in the future.Please use a DataFrame with bool type warnings.warn(
| support | itemsets | |
|---|---|---|
| 33 | 0.130435 | (set of 3 cake tins pantry design) |
| 28 | 0.130435 | (red toadstool led night light) |
| 31 | 0.115942 | (roses regency teacup and saucer) |
| 15 | 0.115942 | (lunch bag red retrospot) |
| 4 | 0.115942 | (baking set spaceboy design) |
| ... | ... | ... |
| 11 | 0.072464 | (homemade jam scented candles) |
| 7 | 0.072464 | (circus parade lunch box) |
| 6 | 0.072464 | (blue happy birthday bunting) |
| 5 | 0.072464 | (black/blue polkadot umbrella) |
| 61 | 0.072464 | (roses regency teacup and saucer, spaceboy lun... |
62 rows Γ 2 columns
type(frequent_itemsets)
pandas.core.frame.DataFrame
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.sort_values(by='lift', ascending=False, na_position='last', inplace = True)
rules.head(5)
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | zhangs_metric | |
|---|---|---|---|---|---|---|---|---|---|---|
| 73 | (dolly girl lunch box, regency cakestand 3 tier) | (roses regency teacup and saucer, spaceboy lun... | 0.072464 | 0.072464 | 0.072464 | 1.0 | 13.8 | 0.067213 | inf | 1.0 |
| 72 | (spaceboy lunch box, regency cakestand 3 tier) | (roses regency teacup and saucer, dolly girl l... | 0.072464 | 0.072464 | 0.072464 | 1.0 | 13.8 | 0.067213 | inf | 1.0 |
| 69 | (roses regency teacup and saucer, dolly girl l... | (spaceboy lunch box, regency cakestand 3 tier) | 0.072464 | 0.072464 | 0.072464 | 1.0 | 13.8 | 0.067213 | inf | 1.0 |
| 68 | (roses regency teacup and saucer, spaceboy lun... | (dolly girl lunch box, regency cakestand 3 tier) | 0.072464 | 0.072464 | 0.072464 | 1.0 | 13.8 | 0.067213 | inf | 1.0 |
| 0 | (spaceboy lunch box) | (dolly girl lunch box) | 0.086957 | 0.086957 | 0.086957 | 1.0 | 11.5 | 0.079395 | inf | 1.0 |
type(rules)
pandas.core.frame.DataFrame
rules.shape
(78, 10)
# Filtering
rules[ (rules['lift'] >= 10) & (rules['confidence'] >= 0.9) ]
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | zhangs_metric | |
|---|---|---|---|---|---|---|---|---|---|---|
| 73 | (dolly girl lunch box, regency cakestand 3 tier) | (roses regency teacup and saucer, spaceboy lun... | 0.072464 | 0.072464 | 0.072464 | 1.0 | 13.8 | 0.067213 | inf | 1.000000 |
| 72 | (spaceboy lunch box, regency cakestand 3 tier) | (roses regency teacup and saucer, dolly girl l... | 0.072464 | 0.072464 | 0.072464 | 1.0 | 13.8 | 0.067213 | inf | 1.000000 |
| 69 | (roses regency teacup and saucer, dolly girl l... | (spaceboy lunch box, regency cakestand 3 tier) | 0.072464 | 0.072464 | 0.072464 | 1.0 | 13.8 | 0.067213 | inf | 1.000000 |
| 68 | (roses regency teacup and saucer, spaceboy lun... | (dolly girl lunch box, regency cakestand 3 tier) | 0.072464 | 0.072464 | 0.072464 | 1.0 | 13.8 | 0.067213 | inf | 1.000000 |
| 0 | (spaceboy lunch box) | (dolly girl lunch box) | 0.086957 | 0.086957 | 0.086957 | 1.0 | 11.5 | 0.079395 | inf | 1.000000 |
| 38 | (spaceboy lunch box, circus parade lunch box) | (dolly girl lunch box) | 0.072464 | 0.086957 | 0.072464 | 1.0 | 11.5 | 0.066163 | inf | 0.984375 |
| 1 | (dolly girl lunch box) | (spaceboy lunch box) | 0.086957 | 0.086957 | 0.086957 | 1.0 | 11.5 | 0.079395 | inf | 1.000000 |
| 40 | (circus parade lunch box, dolly girl lunch box) | (spaceboy lunch box) | 0.072464 | 0.086957 | 0.072464 | 1.0 | 11.5 | 0.066163 | inf | 0.984375 |
| 42 | (circus parade lunch box) | (spaceboy lunch box, dolly girl lunch box) | 0.072464 | 0.086957 | 0.072464 | 1.0 | 11.5 | 0.066163 | inf | 0.984375 |
| 44 | (roses regency teacup and saucer, spaceboy lun... | (dolly girl lunch box) | 0.072464 | 0.086957 | 0.072464 | 1.0 | 11.5 | 0.066163 | inf | 0.984375 |
| 45 | (roses regency teacup and saucer, dolly girl l... | (spaceboy lunch box) | 0.072464 | 0.086957 | 0.072464 | 1.0 | 11.5 | 0.066163 | inf | 0.984375 |
| 56 | (circus parade lunch box) | (dolly girl lunch box) | 0.072464 | 0.086957 | 0.072464 | 1.0 | 11.5 | 0.066163 | inf | 0.984375 |
| 55 | (circus parade lunch box) | (spaceboy lunch box) | 0.072464 | 0.086957 | 0.072464 | 1.0 | 11.5 | 0.066163 | inf | 0.984375 |
| 29 | (roses regency teacup and saucer, regency cake... | (spaceboy lunch box) | 0.072464 | 0.086957 | 0.072464 | 1.0 | 11.5 | 0.066163 | inf | 0.984375 |
| 64 | (roses regency teacup and saucer, dolly girl l... | (regency cakestand 3 tier) | 0.072464 | 0.086957 | 0.072464 | 1.0 | 11.5 | 0.066163 | inf | 0.984375 |
| 65 | (roses regency teacup and saucer, regency cake... | (dolly girl lunch box) | 0.072464 | 0.086957 | 0.072464 | 1.0 | 11.5 | 0.066163 | inf | 0.984375 |
| 66 | (roses regency teacup and saucer, dolly girl l... | (spaceboy lunch box) | 0.072464 | 0.086957 | 0.072464 | 1.0 | 11.5 | 0.066163 | inf | 0.984375 |
| 70 | (roses regency teacup and saucer, regency cake... | (spaceboy lunch box, dolly girl lunch box) | 0.072464 | 0.086957 | 0.072464 | 1.0 | 11.5 | 0.066163 | inf | 0.984375 |
| 11 | (roses regency teacup and saucer, regency cake... | (dolly girl lunch box) | 0.072464 | 0.086957 | 0.072464 | 1.0 | 11.5 | 0.066163 | inf | 0.984375 |
| 10 | (roses regency teacup and saucer, dolly girl l... | (regency cakestand 3 tier) | 0.072464 | 0.086957 | 0.072464 | 1.0 | 11.5 | 0.066163 | inf | 0.984375 |
| 22 | (dolly girl lunch box, regency cakestand 3 tier) | (spaceboy lunch box) | 0.072464 | 0.086957 | 0.072464 | 1.0 | 11.5 | 0.066163 | inf | 0.984375 |
| 21 | (spaceboy lunch box, regency cakestand 3 tier) | (dolly girl lunch box) | 0.072464 | 0.086957 | 0.072464 | 1.0 | 11.5 | 0.066163 | inf | 0.984375 |
| 3 | (alarm clock bakelike green) | (alarm clock bakelike red) | 0.086957 | 0.086957 | 0.086957 | 1.0 | 11.5 | 0.079395 | inf | 1.000000 |
| 2 | (alarm clock bakelike red) | (alarm clock bakelike green) | 0.086957 | 0.086957 | 0.086957 | 1.0 | 11.5 | 0.079395 | inf | 1.000000 |
| 28 | (roses regency teacup and saucer, spaceboy lun... | (regency cakestand 3 tier) | 0.072464 | 0.086957 | 0.072464 | 1.0 | 11.5 | 0.066163 | inf | 0.984375 |